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Abstract 

This  note  describes  software  implementing  the  SMART  algorithm.  SMART  gener¬ 
alizes  the  projection  pursuit  method  to  classification  and  multiple  response  regression. 
SMART  also  provides  a  more  efficient  algorithm  for  single  response  projection  pursuit 
regression. 
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SMART 

User’s  Guide 

1*  Introduction 

SMART  (Smooth  Multiple  Additive  Regression  Technique)  is  a  method  for  modeling 
a  set  of  response  variables  Yi  (1  <  *  <  q)  as  functions  of  a  set  of  predictor  variables 
Xj  (1  <  j  <  p)  based  on  matched  observations  (training  data) 

yik,y2k,--‘ iyqk,xik,X‘2k,--->xpk  (i<&<iV)  (°) 

the  models  take  the  form 

M  p 

|  j  a?2>  ’  '  ' )  ®p]  =  Y{  +  ^  ^  Pimfm(y  1  (1) 

m—l  j=l 

c 

with  Yi  =  EYi,  Efm  =  0,  Efa  =  1  and  £y=1  a?m  =  1.  The  coefficients  pim,ccjm, 
and  the  functions  fm  are  parameters  of  the  model  and  are  estimated  by  least  squares.  The 
criterion 

q  M 

£2  =  E "W  -  Yi  -  Y,  ftm/ m(c£*)f  (2) 

«=1  m—l 

is  minimized  with  respect  to  the  parameters  /?,m,  =  (o;lm  •  •  •  «Pm)  and  the  functions 

fm.  The  response  weights  Wi  (1  <  *  <  g),  specified  by  the  user,  permit  some  flexibility  in 
specification  of  the  loss  metric  (see  below).  The  expected  values  are  computed  from  the 
data  as 

N  N 

e\z\ = £  £  wk  (3) 

k=l  k=l 

where  Z  is  considered  to  be  a  random  variable  and  Zk  (1  <  k  <  N)  are  its  realized 

values  in  the  data.  The  observation  weights  Wk{l  <  k  <  N ),  specified  by  the  user,  can 
be  employed  to  assign  differing  mass  to  different  observations.  They  can  also  be  used 
to  implement  iterative  reweighting  schemes  for  robustification  or  approximate  maximum 
likelihood  fitting. 

It  should  be  noted  that  the  loss  criterion  L%  (2)  is  sensitive  to  the  relative  scales  of 
the  response  variables  Yi  as  is  true  for  any  distance  measure.  The  influence  of  each  Yi  will 
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be  in  proportion  to  its  variance  Var(Yi).  If  it  is  desired  that  each  response  variable  have 
the  same  effect  on  the  loss  criterion  one  can  set  W{  =  1/Var  (F,)  or  rescale  the  responses 
to  have  the  same  variance. 

From  (1)  it  is  seen  that  SMART  modeling  is  a  generalization  of  projection  pursuit 
regression  PPR  (Friedman  and  Stuetzle,  1981).  Each  response  variable  is  modeled  as 
a  (usually)  different  linear  combination  of  the  predictor  functions  fm-  Each  predictor 
function  is  taken  as  a  (smooth,  but  otherwise  unrestricted)  function  of  a  (usually)  different 
linear  combination  of  the  predictor  variables.  Estimates  for  the  parameters  of  the  linear 
combinations,  and  the  functions,  are  chosen  to  be  the  values  that  minimize  the  loss  criterion 
1,2  (2).  For  the  case  of  a  single  response  variable  (q  =  1)  SMART  models  have  the  same 
form  as  PPR  models  but  they  differ  from  PPR  models  in  that  SMART  chooses  estimates 
that  minimize  (2)  whereas  PPR  choooses  the  a ^  (1  <  m  <  M)  in  a  forward  stagewise 

manner.  This  can  result  in  considerably  different  models,  especially  when  there  are  high 
associations  among  the  predictor  variables. 

2.  Classification 

Classification  is  closely  related  to  regression.  Here  a  single  response  variable  Y  assumes 
several  categorical  (unorderable)  values  (cj,  eg,  •  •  • ,  cq).  The  loss  criterion  is  usually  taken 
to  be  the  misclassification  risk 

v 

R  =  J5|  mm  £  Uj  p{ *  I  *i ,  *a,  •  •  •  *p)]  (4) 

-3-q  i=  i 

where  Uj  is  the  (user  specified)  loss  for  predicting  Y  =  Cj  when  its  true  value  is  c,-  (/,-,-  =  0). 
The  conditional  probability  p(i  \  X\  •  •  •  xp)  is  the  probability  that  Y  =  c,-  given  a  particular 
set  of  values  for  the  predictor  variables  Xi  •  •  •  xp.  The  sum  in  (4)  is  simply  the  loss  for 
predicting  Y  =  Cj  given  X\  •  •  •  xp.  The  minimization  operation  provides  a  decision  rule 
that  minimizes  this  loss  at  each  set  of  predictor  values.  The  risk  is  then  the  expected  or 
average  loss  using  this  optimal  decision  rule.  The  art  of  classification  is  to  find  estimates 
of  the  conditional  probabilities  that  minimize  the  misclassification  risk. 

Defining  category  (class)  indicator  variables  for  each  observation  k  as 

*  _  f  1  ^ yk  =  Ci  1  <k<N 
*  ( 0  otherwise  1  <  *  <  q 
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one  has 


p{i  |  xi  •  •  •  xp)  =  —E[Hi  |  xx  •  •  •  xp] 

Si 


(5) 


with  iTi  the  unconditional  (prior)  probability  that  Y  —  ct-  (Hi  =  1),  s,- 


and 

S  —  'Y2  s* •  Here  6  is  the  Kronecker  delta  function 

i=l 


N 

=  wkHyk,Ci), 
fc-1 


*(«,»)  =  { 


1  if  a  =  b 
0  otherwise 


Substituting  (5)  into  (4)  one  has 


R=Ehm  i  **•"*»!]  t6) 

From  this  one  sees  that  the  optimal  decision  rule  for  a  given  set  of  predictor  values  X\--xp 
is  to  assign  Y  =  Cj*  where  J*  is  the  integer  value  (1  <  J*  <  q )  that  minimizes  the  sum  in 

(6). 

When  the  prior  probabilities  tt,-  (1  <  t  <  g)  are  unknown,  they  can  be  estimated  from 
the  data  as  =  Si/S.  Often  the  losses  Uj  axe  taken  to  be  simply  /,y  =  1  —  8(i,j).  When 
both  of  these  situations  occur  the  misclassification  risk  reduces  to  simply  the  misclassifi- 
cation  probability. 

SMART  models  the  condition  expectations  (5,  6)  in  form  given  by  (1).  Ideally  the 
parameter  and  function  estimates  should  be  chosen  to  minimize  the  misclassification  risk 
(6).  However,  as  discussed  in  Breiman,  Friedman,  Olshen  and  Stone  (1983)  (see  also  Efron, 
1978),  this  can  lead  to  difficulties  due  to  the  non-convexity  of  (6).  A  good  surrogate  is  the 
squared  error  loss  criterion  £2  (2)  with 


(7) 


3.  Modeling  Strategy. 

It  is  the  purpose  of  the  SMART  algorithm  to  minimize  £2  (2)  with  respect  to  the 
parameters  Pim,a3mf  and  functions  fm  (1  <  *  <  q,  1  <  y  <  p,  1  <  m  <  M)i  given  the 
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training  data  (0)  and  a  loss  metric  Wi  (1  <  t  <  q).  The  specifics  of  the  algorithm  are  given 
in  Appendix  1.  The  principal  task  of  the  user  is  to  choose  Af  (2)  the  number  of  predictive 
terms  comprising  the  model.  Increasing  the  number  of  terms  decreases  the  bias  (model 
specification  error)  at  the  expense  of  increasing  the  variance  of  the  (model  and  parameter) 
estimates.  Since  the  expected  squared  error,  ESE,  is  the  sum  of  these  two  effects  -  ESE 
=  (bias)2  +  variance,  there  is  an  optimal  value  for  Af.  Sample  reuse  techniques  can  be 
used  to  estimate  these  effects  -  ESE  through  cross-validation  (Stone,  1977)  and  (Geisser, 
1975),  and  variance  through  bootstrapping  (Efron,  1983).  It  is  possible  to  implement  these 
procedures  in  conjunction  with  SMART  with  the  aim  of  estimating  an  optimal  value  for 
Af  as  well  as  confidence  intervals  for  estimates. 

Since  the  variance  tends  to  increase  linearly  with  increasing  Af  while  the  (bias)2  tends 
to  drop  rapidly  for  small  (increasing)  Af,  leveling  off  to  a  slow  decrease  for  larger  Af,  a 
good  estimate  for  the  optimal  Af  value  can  usually  be  made  by  simply  inspecting  L2  vs. 
Af  for  various  values  of  Af .  That  point  at  which  a  unit  decrease  in  Af  leads  to  a  relatively 
large  increase  in  L2  (compared  to  that  for  close-by  larger  Af  values)  is  often  a  good  choice. 
Since  the  ESE  tends  to  vary  slowly  as  a  function  of  Af  in  the  region  near  the  optimal  Af 
value  (especially  on  the  side  of  increasing  Af),  the  choice  is  not  critical  provided  it  is  not 
too  small. 

For  a  given  value  of  Af,  solutions  (minimizing  L2)  may  not  be  unique.  Sometimes 
there  are  local  minima  that  can  trap  the  SMART  algorithm  thereby  masking  a  better 
global  minimum.  Such  local  minima  represent  solutions  that  are  relevant  to  larger  (higher 
Af )  models.  Solutions  are  not  necessarily  found  in  optimal  order  as  Af  is  increased.  This 
suggests  a  backwards  stepwise  model  selection  procedure. 

The  strategy  is  to  start  with  a  relatively  large  value  of  Af  (say  Af  =  Ml)  and  find 
all  models  of  size  Ml  and  less.  That  is,  solutions  that  minimize  L2  are  found  for  Af  = 
Ml,  Ml  —  1,  Ml  —  2,  •  •  • ,  1  in  order  of  decreasing  Af.  The  starting  parameter  values 
for  the  numerical  search  in  each  Af-term  model  are  the  solution  values  for  the  Af  most 
important  (out  of  Af  4-  1)  terms  of  the  previous  model.  Term  importance  is  measured  as 

9 

/»  =  £Wilftm  I  (1  <m<M)  (8) 

t'=l 
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normalized  so  that  the  most  important  term  has  unit  importance. 

(Note  that  the  variance  of  all  ftn  is  one.)  The  starting  point  for  the  minimization  of  the 
largest  model,  M  =  Ml,  is  given  by  an  Ml  term  stagewise  model  (Friedman  and  Stuetzle, 
1981). 

The  sequence  of  solutions  generated  in  this  manner  is  then  examined  by  the  user  and 
a  final  model  is  chosen  according  to  the  guidelines  above. 

4.  Relative  Importance  of  Predictor  Variables 

It  is  often  useful  to  have  an  idea  of  the  relative  importance  of  each  predictor  variable  to 
the  final  model.  For  (single  response)  linear  models  an  often  used  measure  is  the  absolute 
value  of  the  corresponding  regression  coefficient  ay  times  a  scale  measure  of  the  predictor 
variable  cry,  Jy  =  <ry  |  ary  |,  ( 1  <  j  <  p).  A  corresponding  relative  importance  measure  for 
(multiple  response)  nonlinear  models  would  be 

g  a 

Ii  =  «s  I>.  E I  f£- 1  (1  <;'<!>) 

.=  1  OJL* 

with  %  =  E[Y{  |  xi  ■  •  •  xp\.  For  SMART  models  (1)  this  becomes 

q  Af 

Ij  =  Oj  ^W,E\  £  0imaJmf'(alz)  |  (1  <  ;  <  p)  (9) 

t=i  m=l 

where  f  =  dfm/dz.  In  the  case  of  only  one  term,  M  =  1,  (9)  is  equivalent  to  /y  =  cry  | 
ay  |.  It  is  important  to  keep  in  mind  that  the  same  care  is  required  in  interpreting  (9)  as 
in  the  corresponding  interpretation  of  regression  coefficients  in  linear  models,  especially  in 
the  presence  of  high  collinearity  among  the  predictor  variables. 

5.  SMART  Software  -  Input 

SMART  software  is  implemented  as  a  collection  of  FORTRAN  subroutines.  The  user 
interface  is  provided  by  the  parameter  list  (calling  sequence)  to  some  of  these  routines. 
In  order  to  apply  SMART  modeling  it  is  necessary  to  write  a  driver  program  that  reads 
the  training  data  (0)  into  internal  storage  arrays,  sets  various  parameters  of  the  problem, 
and  then  pass  these  to  SMART  through  the  parameter  list  of  the  appropriate  SMART 
subroutine. 
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5.1  SMART  Regression 


CALL  SMARTR  (ML,  MU,  P,  Q,  N,  W,  X,  Y,  WW, 

SMOD,  NSMOD,  SP,  NSP,  DP,  NDP) 

The  first  nine  parameters  are  input  defining  the  problem  and  the  last  six  define  storage 
workspace  necessary  for  the  operation  of  the  program. 

The  first  two  parameters  ML,  MU  (type  integer)  define  the  sequence  of  models  in 
the  backwards  stepwise  model  selection  procedure  described  above  (Section  4).  The  value 
of  the  first  parameter,  ML,  defines  the  number  of  terms  (M  in  (1))  in  the  largest  model 
of  the  sequence  while  the  second  similarily  defines  the  smallest  model  in  the  sequence. 
This  smallest  MU -term  model  is  the  one  stored  (in  SMOD,  see  below)  for  later  predictive 
use.  (Also  the  predictive  functions  fm(  a^x)  are  only  returned  for  the  MU  -term  model 
and  relative  predictor  variable  importances  (9)  are  calculated  only  for  this  model.)  A  good 
strategy  is  to  initially  set  ML  reasonably  large  (subject  to  computing  time  limitations)  and 
set  MU  =  1,  thereby  generating  all  models  of  size  M  =  ML  and  less  (1).  The  particular 
model  selected  by  the  user  can  then  be  computed  and  stored  for  later  predictive  use.  Also, 
the  predictive  functions  and  relative  predictor  variable  importances  (9)  can  be  inspected 
for  this  model.  This  is  accomplished  by  rerunning  the  program  with  the  same  value  for 
ML  but  with  MU  set  to  the  size  of  the  user  selected  model. 

The  next  three  parameters  (type  integer)  define  the  size  of  the  problem: 

P  =  number  of  predictor  variables 
Q  =  number  of  response  variables 
N  =  number  of  observations. 

these  correspond  to  the  quantities  p,q,N  of  (2). 

The  next  four  parameters  (type  real)  contain  the  data  and  corresponding  weights. 
These  are  arrays  that  must  be  declared  and  appropriately  dimensioned  in  the  user  written 
driver  program: 

Real  W(N):  observation  weights 

W(K)  contains  the  weight  for  the  Kth 
observation^*  in  (3)) 


7 


Real  X(P,  N ):  predictor  data  matrix 


X(J,K )  contains  the  value  of  the  Jth 
predictor  variable  of  Kth  observation. 

Real  Y (Q,  N):  response  data  matrix 

Y(I,  K)  contains  the  value  of  the  Ith 
response  variable  of  Kth  observation. 

Real  WW{Q ):  response  weights 

WW{I)  contains  the  response  weight  for 
Ith  response  ( W{  in  (2)). 

The  final  six  parameters  define  storage  workspace  necessary  for  the  operation  of  the 
algorithm.  The  parameters  SMOD,  SP,  and  DP  are  arrays  that  must  be  declared  and 
dimensioned  in  the  calling  program.  The  quantities  NSMOD,  NSP,  and  NDP  (type  integer) 
give  the  dimensions  assigned  (by  the  user)  to  the  corresponding  three  storage  arrays  so 
that  SMART  can  check  if  the  workspace  sizes  (dimensioned  values)  are  large  enough. 

REAL  SMOD  (NSMOD):  stores  parameters  of  the  final  (M  =  MU)  term 

model.  Minimum  dimension  is 
NSMOD  >^ML{P  +  Q  +  2N)  +  Q  +  7. 

REAL  SP(NSP):  single  precision  scratch  storage 

workspace.  Minimum  dimension 
is  NSP  >.N(Q  +  15)  +  Q  ±  3 P. 

DOUBLE  PRECISION  DP(NDP):  double  precision  scratch 

storage  workspace.  Minimum  dimension 
is  NDP  >^P(P+l)/2  +  6P. 

5.2  SMART  Classification 

CALL  SMARTC  (ML,  MU,  P,  Q,  N,  W,  X,  C,  PI,  FLS, 

SMOD,  NSMOD,  SP,  NSP,  DP,  NDP) 

The  first  ten  parameters  are  input  defining  the  problem  and  the  last  six  define  storage 
workspace.  The  parameters  ML,  MU,  P,  N,  W,  X,  DP,  NDP  are  identical  to  the  corre¬ 
sponding  ones  for  regression  and  are  described  in  Section  5.1. 
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The  parameter  Q  (type  integer)  gives  the  number  of  classes  -  the  number  of  distinct 
values  for  the  categorical  response  variable,  C.  The  array  C  (type  real)  gives  the  class 
identity  of  each  observation: 

REAL  C{N ): 

C(K)  contains  the  value  of  class 
label  for  Kth  observation 
(1.0  <  C{K)  <  FLOAT (Q)),  K=1,N. 

The  next  two  parameters  (type  real)  define  the  prior  (uncondition)  probabilities  and 
the  loss  structure  for  the  classification  problem. 

Real  PI[Q):  class  priors 

PI(I)  contains  the  prior  probability  for  Ith 
class  (tt,-)  in  (5)  and  (7) 

Q 

(PI{I)  >  0,  /=  1,Q,  and  £p/(J)  =  1). 

/=  i 

Real  FLS(Q,  Q):  loss  matrix 

FLS(I,  J)  contains  the  loss  for 
misclassifying  a  class  I  observation 
as  class  J  (/,_,-  in  (4)  and  (6)). 

(FLS(I,J)  >  0,  J,  and 
FLS{I,  I)  =  0,  /  =  1,  Q,  J  =  1,  Q.) 

Often  the  prior  probabilities  (1  <  *  <  q)  are  unknown  and  are  to  be  estimated 
from  the  data  as  £,•  =  s,/S  where  s,-  is  the  sum  of  (observation)  weights  for  the  class 
i  observations  and  S  is  the  sum  of  weights  for  all  observations.  When  this  is  the  case, 
there  are  no  user  defined  prior  probabilities  and  the  PI  array  need  not  be  declared  or 
dimensioned  in  the  calling  program.  This  situation  is  indicated  by  passing  a  single  scalar 
value  BIG  in  place  of  the  PI  array  in  the  corresponding  position  in  the  parameter  list. 
The  value  of  BIG  is  a  large  number  defined  internally  to  SMART  -  see  section  5.3 

Similarly,  only  a  simple  loss  structure  is  often  desired,  namely  /tJ-  =  1  —  <5(t,;).  That 
is,  a  simple  unit  loss  for  each  misclassified  observation.  When  this  is  the  case  the  FLS 
array  need  not  be  declared  or  dimensioned  in  the  calling  program,  and  the  single  value 
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BIG  entered  its  place  in  the  parameter  sequence. 

The  storage  workspace  arrays  SMOD  and  SP,  and  their  corresponding  array  dimen¬ 
sions  NSMOD  and  NSP,  have  the  same  meaning  as  for  the  regression  problem  (Section 
5.1),  however,  the  size  of  the  dimensions  must  be  a  little  larger  for  classification: 

REAL  SMOD(NSMOD),  SP(NSP) 

NSMOD  >  ML(P  ±  Q  +  2N)  +  2Q  ±  7 
NSP  >  N(Q +15)  4- 2Q  +  3P. 
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5.3  Incidental  Parameters 


COMMON/PARMS/IFL,  LF,  SPAN,  ALPHA,  BIG 

This  labeled  common  contains  internal  parameters  of  the  algorithm  that  the  user  may 
wish  to  change.  Default  values  for  these  parameters  are  set  at  compile  time  in  a  BLOCK 
DATA  subprogram.  This  labeled  common  need  only  be  declared  in  the  user’s  calling 
program  if  he  wishes  to  change  any  of  their  values  from  the  default  settings.  This  can  be 
done  using  executable  assignment  statements  in  the  user  routine  in  which  this  common  is 
declared. 

INTEGER  IFL:  FORTRAN  file  number  for  writing 

printed  output  (Default,  IFL=6) 

If  IFL  <  0  no  printed  output 
will  be  generated. 

INTEGER  LF:  Optimizing  level  for  minimization 

algorithm,  0  <  LF  <  3  (default, 

LF=2).  This  controls  tradeoff  between 
speed  and  thoroughness  of  optimization 
algorithm  (See  Appendix  1). 

REAL  SPAN,  ALPHA:  Super  smoother  parameters.  These 

control  operation  of  smoother  used 
to  obtain  function  estimates  (See 
Friedman,  1984) 

(Default  values,  SPAN=0.0,  ALPHA 

=0.0.) 

REAL  BIG:  A  large  representative  floating 

point  number  very  much  larger  than  the 
largest  possible  (absolute)  data 
value  (Default,  BIG  —  1020) 
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6.  SMART  Software  -  Output 

The  primary  output  of  SMART  is  a  model  for  estimating  E[Yi  j  zi  •  •  •  zp]  and,  for 
classification,  a  decision  rule  (6).  A  second  level  of  output  would  be  the  parameters  of 
the  SMART  model  (1).  These  are  useful  for  interpretation  of  the  dependence  of  each  Vi- 
on  (xi  •  •  •  xp).  A  third  level  of  output  would  be  additional  diagnostic  information  (8),  (9) 
useful  for  interpretation. 

6.1  Primary  Output 

For  a  given  set  of  values  comprising  a  predictor  vector  (xi  •  •  •  xp),  the  SMART  estimate 
(1)  for  F,(xi  ••■Xp)  ~  E[Yi  |  x\  ■  •  ■  xp]  is  obtained  by  executing 
CALL  YHAT  (XT,  SMOD,  YH). 

This  statement  must  be  executed  after  calling  either  SMARTR  (Section  5.1)  or  SMARTC 
(Section  5.2).  The  array  SMOD  must  be  the  same  as  in  the  call  to  SMARTR  or  SMARTC. 
The  quantities  XT  and  Y H  have  the  following  meaning: 

Real  XT(P):  input  predictor  vector 

XT(J)  contains  the  value  of  Jth  variable,  zy. 

Real  YH(Q):  output  expected  response  values  (given  XT) 

YH(I)  contains  the  estimated  expected  value  for  Ith  response, 

Yi(xj  •  •  •  xp). 

For  classification  the  minimum  (estimated)  risk  decision  rule  (6)  is  obtained  by  exe¬ 
cuting 

CALL  CLSFY  (PI,  FLS,  YH,  SMOD,  ICL,  RSK). 

This  statement  must  be  executed  after  calling  both  SMARTC  and  YHAT.  The  quantities 
PI,  FLS  SMOD  are  described  in  Section  5.2  and  must  be  the  same  as  in  the  call  to 
SMARTC.  The  array  YH  is  the  output  response  vector  from  YHAT  (see  above,  this 
section)  giving  the  relative  class  probabilities  given  XT  =  (Xi,  X%  •  •  •  Xp ).  The  two  output 
quantities  (from  CLSFY)  are  ICL  and  RSK.  The  first,  ICL,  contains  as  its  value  the 
class  assignment  that  minimizes  the  (estimated)  misclassification  risk,  while  RSK  has  the 
estimated  value  of  this  minimum  risk.  This  second  quantity  is  useful  in  accessing  the 
relative  confidence  in  the  class  assignment. 
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SMART  modeling  (as  implemented  here)  does  not  constrain  the  values  of  the  response 
conditional  expectation  estimates.  When  these  expectations  are  interpreted  as  conditional 
probabilities  (as  in  the  case  of  classification),  it  is  useful  to  constrain  their  individual  values 
so  that  the  corresponding  conditional  probabilities  (5)  are  between  zero  and  one  and  sum 

A 

to  one.  When  these  constraints  are  violated  by  the  Y,  as  output  from  YHAT,  CLSFY 
modifies  their  values  to  satisfy  these  constraints  in  a  way  that  preserves  the  relative  values 
of  the  corresponding  probabilities  p,-  (5).  The  conditional  probability  values  are  stored 
in  YH  (replacing  the  conditional  expectation  estimates)  before  returning  to  the  calling 
program. 

6.2  Secondary  Output 

The  parameters  of  the  SMART  model  are  packed  in  the  SMOD  array  upon  return 
from  SMARTR  or  SMARTC.  Several  user  callable  subroutines  are  available  to  obtain 
them  in  a  convenient  form  under  program  control.  In  addition  some  of  these  parameters, 
M,  ctjm,  Pim  (*  =  l,q,  j  =  l,p,  m  =  1,  M)  (1),  appear  on  the  standard  (printer)  output 
file  IFL  (provided  IFL  >  0,  see  section  5.3).  The  functions  fm  (m  =  1  ,M)  (1)  do  not 
appear  on  the  standard  output  file.  They  must  be  obtained  under  program  control,  as 
described  below,  and  then  transferred  to  a  local  (installation  dependent)  graphics  library 
for  representation  on  a  graphical  output  device.  These  functions  are  available  only  for  the 
final  MU-term  model  (see  Section  5.1). 

The  value  of  the  goodness-of-fit  criterion,  FL2,  1 12  (2)  for  the  final  MU-term  model 
can  be  obtained  by  executing  the  statement 
FL2  =  GOF(SMOD,  MU). 

The  array  SMOD  must  be  the  same  as  in  the  call  to  SMARTR  (Section  5.1)  or  SMARTC 
(Section  5.2).  The  output  quantity,  MU,  is  the  number  of  terms  of  the  final  (user  specified) 
model. 

The  parameter  vectors  aym  (1  <  j  <  p)  and  /3,m  (1  <  *  <  q)  can  be  obtained  for  each 
term,  m,  by  executing  the  statement 

FP  =  GTPRMS(ITERM,  SMOD,  A,  B). 

The  array  SMOD  must  be  the  same  as  in  the  call  to  SMARTR  or  SMARTC.  The  input 
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quantity  ITERM  is  the  term  number  for  which  the  parameters  are  desired.  The  parameters 
are  stored  in  the  real  arrays  A  and  B: 

REAL  A(P):  parameters  of  predictor  linear  combinations 

A(J)=  ajm  (J,j=l,p) 

REAL  B(Q):  parameters  of  the  response  linear  combinations 

B(I)  (/,*=!,?) 

and  ITERM  =  m. 

The  quantity  FP  has  the  value  1.0  if  1  <  ITERM  <  MU  and  0.0  otherwise.  If  FP  =  0.0, 
then  no  values  are  stored  in  the  output  arrays. 

Each  function  /m(a^x)  is  represented  by  a  set  of  matched  pairs  ( fmk,tmk ),  1  <  k  < 
N,  one  pair  for  each  observation.  Here  fmk  =  fm(tmk)  with  tmk  =  amxk-  These  pairs  can 
be  plotted  as  points  on  an  available  graphics  device  with  fm  as  the  ordinate  and  tm  the 
abscissa.  The  points  representing  the  function  for  each  term,  m,  are  obtained  by  executing 
the  statement 

FP  =  GTFUN(ITERM,  SMOD,  F,  T). 

The  quantities  FP,  ITERM,  SMOD  have  the  same  meaning  as  with  GTPRMS  described 
in  the  preceding  paragraph.  The  function  is  stored  in  the  two  output  arrays: 

REAL  F(N):  ordinates,  F(K)  is  the  ordinate  value  for  the  Kth  observation 
Real  T(N):  abscissas,  T(K)  is  the  abscissa  value  for  the  Kth  observation 
where  K=l,  N.  (Note  that  the  observations  are  labeled  here  in  increasing 
order  of  T(K)  rather  than  in  their  order  in  the  data  matrix.) 

6.3  Third  Level  Output 

In  addition  to  the  output  obtainable  under  program  control,  SMART  also  writes 
information  to  the  standard  output  file  IFL  (provided  IFL>0,  see  Section  5.3).  This 
information  can  help  with  model  selection  and  in  interpretation  of  the  selected  final  model. 
For  this  output  the  goodness-of-fit  is  always  expressed  in  terms  of  fraction  of  unexplained 
variance  defined  as 

#  =  L,lY.wiWi-Yi il2  (i°) 

*=1 
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with  Y i  =  EYi  and  X2  given  by  (2).  Note,  however,  that  FUNCTION  GOF  (Section  6.2) 
returns  the  value  of  X2,  not  e2,  for  the  final  model. 

This  third  level  output  consists  of  a  listing  of  all  of  the  (Af-term)  solutions  MU  <  M  < 
ML ,  in  order  of  decreasing  value  of  M.  Each  solution  is  represented  by  the  {aym,/3im}, 
(1  <  «  <  g,  1  <  j  <  p),  for  each  term  m  (1  <  m  <  Af).  The  a ym  are  given  in  order 
of  increasing  j  {A  =)  and  the  /?,-m  in  order  of  increasing  i(B  =).  The  value  of  e2  for  a 
solution  precedes  the  parameter  listings  of  its  terms. 

Following  the  term  parameter  listings  of  a  particular  solution  is  a  listing  of  the  relative 
importance  of  each  term  Jm  (8)  (1  <  m  <  M).  The  starting  parameter  values  in  the 
numerical  search  for  the  next  smaller  (Af  —  1  term)  model  are  the  solution  values  for  the 
M  —  1  most  important  terms  of  this  (Af-term)  model. 

Following  the  relative  term  importance  listing  is  a  listing  of  the  fraction  of  unexplained 
variance  e2  for  each  response  Y{  (1  <  »  <  q)  separately.  Here 

ef  =  E[Yi  -Yi-Y,  -  Yi}\ 

m=l 


this  output  does  not  appear  if  there  is  only  one  response  variable  ( q  =  1). 

For  classification  there  are  two  additional  quantities  listed  with  each  (Af-term)  solu¬ 
tion.  These  are  two  different  estimates  of  the  misclassification  risk  associated  with  using 
this  model  for  the  conditional  expectations  in  a  minimum  risk  decision  rule  (6).  The 
first  estimate  Ri  (MISCLASSIFICATION  RISK)  is  obtained  by  classifying  each  training 
observation  k  (1  <  k  <  N)  using  the  minimum  loss  rule  (6) 


=  *i*"**pJ} 


in) 


and  then  computing  the  risk  by  averaging  the  loss  associated  with  the  resulting  misclassi- 
fications 

S‘  =  E“»E^;'  <12) 


fc=l  «=i 


The  second  estimate  (CALCULATED  FROM  PROBABILITY  ESTIMATES)  is  the 
value  of  R  (6)  computed  by  substituting  the  conditional  expectation  estimates  of  this 
(Af-term)  model  directly  into  (6). 
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To  the  extent  that  the  conditional  expectation  (probability)  estimates  are  accurate 
these  two  risk  estimates  should  have  similar  values.  Note  that  Ri  is  nearly  always  less  than 
Rz-  However,  it  is  often  possible  to  do  accurate  classification  in  the  presence  of  very  poor 
probability  estimates.  This  is  especially  true  for  the  simple  loss  structure  =  1  —  S(i,j) 
where  it  is  only  necessary  to  correctly  estimate  which  class  has  the  highest  probability  given 
x\  •  •  •  xp.  The  probability  values  themselves  or  even  their  order  (except  for  the  largest)  are 
not  needed  in  this  case. 

Comparing  the  values  of  Ri  and  R^  gives  some  indication  of  how  well  the  model 
conditional  expectation  estimates  are  approximating  the  true  underlying  probabilities.  If 
Ri  is  much  smaller  than  R?  (which  is  often  the  case)  then  the  probability  estimates  are 
not  too  close.  In  this  case  some  care  should  be  exercised  in  interpreting  the  values  of 
YH(I)  (!</<<?)  and  RSK  as  returned  by  CLSFY  (see  Section  6.1). 

The  quantities  described  (so  far)  in  this  section  are  listed  for  each  M-  term  solution 
(MU  <  M  <  ML).  The  final  MU- term  solution  is  the  SMART  model  relevant  to  the 
output  described  in  Sections  6.1  and  6.2.  The  relative  importance  of  each  predictor  variable 
/y  (1  <  j  <  P )  (9)  (standardized  so  that  the  most  important  variable  has  unit  importance) 
is  also  computed  for  this  last  MU -term  SMART  model  and  listed  at  the  end  of  the  standard 
output. 
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Appendix  I 


Numerical  optimization  of  least  squares  criterion  for  SMART  models 

This  section  discusses  the  minimization  of  £2  (2)  simultaneously  with  respect  to  ajm 
(1  <  j  <  P)  >  Pim  (1  <  *  <  g)  and  the  functions  fm{  1  <  m  <  M)  for  a  given  number  of 
terms  M.  An  alternating  optimization  strategy  is  used.  The  parameters  are  grouped  such 
that  the  solution  for  those  in  each  group  is  straightforward  given  fixed  values  for  those 
outside  the  group.  A  solution  is  obtained  for  the  variables  in  the  group  and  these  solution 
values  replace  their  current  parameter  values.  Attention  is  then  focused  on  the  next  group 
and  this  process  repeated  for  its  parameters.  After  solutions  have  been  obtained  for  all 
groups  of  parameters,  another  pass  is  made  over  the  groups  obtaining  new  solution  values 
given  the  new  values  for  the  parameters  outside  each  group  obtained  in  the  previous  pass. 
These  passes  are  repeated  until  the  loss  criterion  £3  (2)  fails  to  decrease  on  two  consecutive 
passes.  Usually  a  threshold  e  is  set  at  a  small  value  and  if  improvement  on  two  consecutive 
passes  is  less  than  e,  iterations  are  stopped  and  the  parameter  values  at  that  point  taken 
as  the  solution.  Since  at  each  step  in  this  process  £2  is  made  smaller  through  a  partial 
minimization,  and  £2  >  0,  the  alternating  optimization  must  converge  (provided  e  is  large 
compared  to  the  numerical  accuracy  of  the  computer’s  arithmetic).  However,  there  is  no 
guarantee  that  the  solution  is  the  global  minimum  of  £3.  It  may  be  a  local  minimum. 
Strategy  for  dealing  with  this  problem  in  the  context  of  SMART  modeling  is  discussed  in 
Section  3. 

The  parameter  grouping  used  in  the  SMART  algorithm  is  hierarchical.  The  first  level 
grouping  is  by  term.  The  parameters  ajm(  1  <  j  <  p),  /?,m(  1  <  t  <  q)  and  the  function 
fm  (for  fixed  m)  form  each  group.  There  are  obviously  M  such  groups.  At  the  second 
level  the  parameters  of  each  term  are  divided  into  three  groups:  the  ajm  (1  <  j  <  p)  form 
the  first  (sub)  grouping,  the  /?tm  (1  <  *  <  q)  form  the  second  and  the  function  fm  forms 
the  third. 

Consider  a  particular  term,  k  (1  <  k  <  M).  The  loss  criterion  (2)  can  be  reexpressed 
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as 


l2  -  4k)  =  Y,Wi  J5[iJ,w  - *»/»(<£*)]' 


1=1 


with 


Ri(k)  =  Yi  -Yi~J2 

m^k 


(Al) 


(A2J 


Equation  Al  isolates  the  kth  term’s  contribution  to  the  criterion.  Following  the  alter¬ 
nating  optimization  strategy  we  minimize  L 2  (L^)  with  respect  to  the  parameters  of  the 
kth  term.  These  parameter  values  are  then  used  to  help  define  Ri{h')tkf  ^  k,  to  obtain 
new  solutions  for  the  parameters  of  other  terms.  Repeated  passes  are  made  over  all  the 
terms  until  convergence  (Z2  stops  decreasing-see  above). 

We  now  focus  on  obtaining  solutions  for  the  parameters  of  the  kth  term  given  R^k) 
(A2).  The  solutions  for  the  /?,*  (given  fk  and  a^)  are  straightforward 


E[fk(otZx)\ 


(A3) 


(Remember  that  i?[.R,(fc)]  —  E[fk{c '%x)]  =  0.) 

The  solution  for  the  function  fk  (given  /?,*  and  »£)  is  almost  as  easily  obtained. 
Reexpressing  L ^  (Al)  as 


LP  =  Ear  |  afx], 


(>4) 


1=1 


we  see  that  it  is  minimized  if  fk  is  chosen  to  minimize  the  conditional  expectation  in  A4 
for  each  value  of  a  £  x.  This  is  accomplished  by 

<1  q 


/*(<■£*)  =  | 


(A5) 


*=1 


»=1 


Since  we  require  Efk  =  0  and  Efy.  =  1,  we  standardize  /£,  making  the  denominator  in 
(A5)  irrelevant. 

It  remains  to  find  a  solution  that  minimizes  L ^  (Al)  with  respect  to  a 1  =  (aifc, 
&2k) •  •  •  Qfpfc)  given  values  for  /?,■*  (1  <  i  <  q )  and  a  (fixed)  function  /*.  Unlike  the  other 
parameters  (/?,*  and  fk),  ocT  does  not  enter  in  a  purely  quadratically  way  into  the  loss 
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criterion.  Therefore,  solutions  may  not  be  unique,  and  they  cannot  be  obtained  in  a  single 
step.  An  iterative  numerical  optimization  must  be  performed. 

The  loss  criterion  i2  (2,  Al)  can  be  expressed  in  the  generic  form 


£*(<**)  =  £wi£fos(a*)]s 


1  =  1 


with 


=  (Ri(k)  ~  Pikfk{c*kxj) 


(A6) 


(A7) 


The  classical  numerical  optimization  technique  for  criteria  of  the  form  (A6)  is  the  Gauss- 
Newton  method  (see  Gill,  Murray  and  Wright,  1981,  Section  4.7).  Let  aj^T  = 

(arfc  > '  "  >  apfc^)  be  a  trial  set  of  values  at  some  point  during  the  optimization.  The  Gauss- 
Newton  estimate  for  the  solution  a £  (the  next  set  of  trial  values  in  the  iterative  process)  is 
ak  =  ai°^T  +  A T  where  the  vector  Ar  is  the  solution  to  the  set  of  simultaneous  equations 


dgi 


i=l 


dak  dctk 


*= l 


dak 


(AS) 


The  function  gi  and  the  vector  of  partial  derivatives  are  evaluated  at  aj^K  Prom  A7  one 
has 

J^(«l0))  =  -0ikfk(*k)T* 0*  (^9) 

where  f\z )  =  df /dz.  After  solving  (A8)  for  A,  ak  replaces  and  the  process  can  be 
repeated  until  convergence  (L2  stops  decreasing). 

It  is  possible  that  a  Gauss-Newton  step  fails  to  decrease  L2  (L2(aj^0*  +  A)  >  L2(a^0*)). 
In  this  case  the  step  is  cut  in  half  ( ak  =  +  A/2).  If  this  new  step  still  results  in  an 

increase  in  L2,  the  step  is  cut  again  (a*  =  +  A/4).  This  repeated  cutting  of  the  step 

is  continued  until  £2  decreases.  Since  the  matrix  on  the  left-hand-side  of  (A8)  is  positive 
definite,  A  =  A/  j  A  j  is  a  valid  descent  direction  and  at  some  point  the  step  halving  must 
give  rise  to  a  decrease  in  L2  (unless  a represents  a  minimum  of  L2). 

As  discussed  in  Section  6.2,  the  functions  fk(akx)  are  stored  as  an  ordinate  and 
abscissa  value  for  each  observation.  The  derivative  estimates  f'k{oi^x)  are  similarity  stored 
(see  below).  These  values  are  obtained  when  fk(akx )  is  evaluated  (A5).  When  is 
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changed  to  a%  (via  Gauss-Newton  update),  an  interpolation  scheme  must  be  employed  to 
obtain  values  for  fk  (a£z)  from  fk(ct^T  x).  This  interpolation  is  almost  as  expensive  as 
obtaining  the  optimal  function  for  the  new  argument  oi%x.  We,  therefore,  do  not  iterate 
the  Gauss-Newton  stepping  until  convergence  for  a  given  function,  but  rather  take  only 
a  single  step.  A  new  (optimal)  function  /fc[(ot^°^r  +  AT)a:]  (A5)  is  evaluated,  and  the 
next  Gauss-Newton  step  (A7-A9)  is  made  based  on  this  new  function.  Step  cutting,  as 
described  above,  is  employed  for  bad  steps.  In  this  way  both  the  function  and  the  predictor 
linear  combination  for  the  k  —  th  term  are  simultaneously  optimized  by  the  Gauss-Newton 
iteration  procedure. 

The  expected  values  2?[-]  are  easily  evaluated  via  (3).  The  conditional  expectation 
estimates  ( A5)  for  evaluation  of  the  optimal  functions  are  more  difficult.  The  method  used 
here  is  described  in  detail  in  Friedman  (1984).  The  derivative  estimates  (9  and  A9)  are 
made  by  taking  first  differences  of  the  function  estimates 

l)  =  [A(a^l).-A(^'-l)l  (2  <  I  <  JV  —  1)  (.410) 

<*k\x*+ 1  -  *1-1 ) 

where  the  n  are  labeled  in  increasing  order  of  a^x.  Endpoints  (/  =  1  and  /  =  N) 
are  handled  by  simply  copying  the  values  of  their  nearest  neighbors.  Such  estimates  can 
become  unstable  if  the  denominator  becomes  too  small.  This  can  be  avoided  by  pooling 
observations  for  which 


|  ctk{xi  -  xi>)  |  <  el  (1  <l,l'<N)  (All) 

into  a  single  observation  for  the  purpose  of  derivative  calculation.  Here  I  is  the  semi- 
interquartile  range  of  a%x  and  €  is  a  small  number  ( e  ~  0.05).  This  pooling  can  be  done 
rapidly  by  using  a  method  similar  to  the  pooled- adjacent-violators  algorithm  for  isotone 
regression  (Kruskal,  1964). 

The  SMART  program  provides  the  user  some  control  over  the  optimization  process. 
This  control  is  exercised  through  the  parameter  LF  in  the  / P ARMS/  common  block  (see 
Section  5.3).  This  parameter  can  take  integer  values  between  zero  and  three  (default, 
LF  =  2).  Level  three  (LF  =  3)  optimization  is  that  which  is  described  above  in  this 
section.  The  other  (lower)  levels  induce  some  shortcuts  in  the  optimization  procedure. 
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Level  two  ( LF  =  2)  optimization  has  the  same  goal  as  level  three:  that  is,  minimize 
L2  (2)  with  respect  to  all  parameters  and  functions.  The  two  levels  differ  only  in  strategy. 
With  level  two  only  one  Gauss-Newton  step  (for  the  ay&,  1  <  j  <  p)  and  corresponding 
function  fk  optimization  is  performed  for  each  new  set  of  /?,*  (1  <  *  <  q)  in  the  iteration 
loop  for  the  kth  term  (see  Al),  rather  than  completely  optimizing  with  respect  to  a3k  and 
fk  for  each  new  set  of  /?,*.  (For  single  response  regression  (q  =  1)  the  two  strategies  are 
equivalent.)  Level  two  optimization  is  usually  faster  but  a  little  less  robust  (more  easily 
trapped  at  saddle  points)  than  level  three. 

The  two  lowest  levels  of  optimization,  levels  zero  and  one,  actually  construct  different 
models.  These  models  have  the  same  form  as  SMART  models  (1)  but  the  parameter  and 
function  estimates  are  obtained  by  partially  rather  than  completely  minimizing  L2  (2). 
Level  zero  (LF  =  0)  implements  a  purely  stagewise  optimization  strategy.  At  each  stage 
the  loss  criterion  L2  (2)  is  minimized  only  with  respect  to  the  parameters  and  function 
of  the  Mth  term,  Pm  (1  <  »  <  q),  cicjM  (1  <  j  <  p)ifM(<*Mx)i  given  the  previously 
established  values  for  the  corresponding  quantities  in  the  earlier  terms  (1  <  m  <  M  —  1). 
The  M  term  model  consists  of  the  newly  established  estimates  at  the  Mth  stage  as  well 
as  those  for  the  previous  ( M  —  1  term)  model. 

Level  one  optimization  (LF  =  1)  represents  a  compromise  between  a  purely  stagewise 
strategy  (level  zero)  and  complete  least  squares  (levels  two  and  three).  Here  the  estimates 
for  the  predictor  linear  combinations  (a^)  are  obtained  in  a  stagewise  manner  as  described 
above.  However,  the  M  term  model  is  obtained  by  completely  minimizing  L2  (2)  with 
respect  to  the  /?,m  (1  <  i  <  q)  and  /m,  given  the  stagewise  estimates  for  the  or^  (1  < 
m  <  M).  For  a  single  response  (g  =  1),  level  one  optimization  is  similar  to  the  procedure 
employed  in  PPR  modeling  (Friedman  and  Stuetzle,  1981).  The  only  differences  are  in 
the  backwards  stepwise  model  selection  procedure  (see  Section  3)  used  by  SMART,  and 
in  the  use  of  the  Gauss-Newton  (rather  than  a  Rosenbrock)  procedure  for  the  numerical 
optimization. 

The  principal  advantage  of  level  zero  or  one  optimization  over  complete  least  squares 
(levels  two  or  three)  is  computation  speed.  If  this  is  a  problem,  then  the  lower  optimization 
levels  can  be  used  to  rapidly  obtain  models  that  are  often  quite  competitive  with  the 
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full  least  squares  solution.  This  is  especially  true  when  there  is  not  a  high  degree  of 
association  among  the  predictor  variables.  Also,  the  lower  optimization  levels  can  be 
useful  for  exploratory  work. 
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